Unit 4 - Linear Regression with Scikit-Learn

Global Population & GDP Analysis

This tutorial demonstrates the application of linear regression analysis to investigate the relationship between a country's population and its GDP per capita. The analysis is performed using Python with libraries such as pandas, numpy, matplotlib, and scikit-learn.

Data Loading and Preprocessing

Loading and Preparing the Data

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr

# Load the datasets
pop_df = pd.read_csv('Global_Population.csv')
gdp_df = pd.read_csv('Global_GDP.csv')

# Preprocess the data
years = [str(y) for y in range(2001, 2022)]
pop_df[years] = pop_df[years].apply(pd.to_numeric, errors='coerce')
gdp_df[years] = gdp_df[years].apply(pd.to_numeric, errors='coerce')

# Handle missing values
pop_df[years] = pop_df[years].apply(lambda row: row.fillna(row.mean()), axis=1)
gdp_df[years] = gdp_df[years].apply(lambda row: row.fillna(row.mean()), axis=1)

# Calculate means
pop_df['mean_population'] = pop_df[years].mean(axis=1)
gdp_df['mean_gdp'] = gdp_df[years].mean(axis=1)

Correlation Analysis

Correlation Plot Correlation between Population and GDP per Capita

Regression Analysis

Regression Plot Linear Regression Analysis

Key Findings

  • The analysis included data from 258 countries and regions over the period 2000-2020.
  • The Pearson correlation coefficient between mean population and mean GDP is 0.72, indicating a strong positive linear relationship.
  • The p-value of 3.46 × 10⁻⁴² confirms that this correlation is statistically significant.
  • The regression line shows a moderate fit to the data, with an R² of 0.52, meaning that approximately 52% of the variation in mean GDP can be explained by the mean population.
  • The regression coefficient is 5909.31, suggesting that for each additional person in the mean population, the mean GDP increases by approximately $5,909 USD.
  • The model has an intercept of 4.39 × 10¹¹, representing the theoretical GDP when population is zero.
Email
GitHub
LinkedIn